Slurm (quick guide)¶
This page is the quickest path to run jobs on the DGX.
Available partitions¶
| Partition | Max walltime | Intended usage | Command style |
|---|---|---|---|
interactive10 |
02:00:00 |
Interactive debugging and quick tests on one 1g.10gb MIG |
srun |
prod10 |
24:00:00 |
Batch jobs on one 1g.10gb MIG |
sbatch |
prod40 |
24:00:00 |
Batch jobs on one 3g.40gb MIG |
sbatch |
prod80 |
24:00:00 |
Batch jobs on one full A100 80GB GPU |
sbatch |
In beginner terms: start with standard GPU (10 GB VRAM) (interactive10/prod10), move to large GPU (40 GB VRAM) (prod40) or full GPU (80 GB VRAM) (prod80) if your model does not fit.
The scheduler applies partition defaults for GPU type, task count and CPU count, so beginner commands can stay short.
Basic workflow¶
1) Quick Python test in current shell (CPU only, resource-limited)¶
You can run Python directly after login for quick checks:
python3 -c "import sys; print(sys.version)"
python3 my_script.py
This login environment is resource-limited by policy. Use it only for small tests and setup tasks.
2) Create and use a virtual environment¶
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install numpy torch
Then run Python as usual:
python train.py
3) Open an interactive GPU session for debugging¶
srun -p interactive10 --time=01:00:00 --pty bash
Inside that shell, activate your virtual environment and run tests:
source venv/bin/activate
python train.py
4) Launch production jobs with sbatch¶
Recommended workflow (start with prod10):
cd ~/my_project
cp ~/slurm-prod10.sbatch ./job.sbatch
nano job.sbatch
sbatch job.sbatch
For reference, a minimal one-line submit also works:
sbatch -p prod10 --time=04:00:00 --wrap="bash -lc 'source venv/bin/activate && python train.py'"
slurm-prod10.sbatch is a basic template installed for users from /etc/skel.
If your account predates this change, copy it manually once:
cp /etc/skel/slurm-prod10.sbatch ~/
Basic Slurm commands¶
sinfo # list partitions and node states
squeue -u $USER # list your jobs
scontrol show job <jobid> # inspect one job
scancel <jobid> # cancel one job
sacct -j <jobid> # job accounting/history
Notes¶
interactive10accepts interactive jobs (srun).prod10,prod40,prod80are batch-oriented and should be used withsbatch.- For exact technical policy (GRES mapping, defaults, limits, quality of service), see Advanced partitions.
- For the current physical/logical GPU split on this DGX, see GPU and MIG layout.